Machine learning to predict unintended pregnancy among reproductive-age women in Ethiopia: evidence from EDHS 2016

Background An unintended pregnancy is a pregnancy that is either unwanted or mistimed, such as when it occurs earlier than desired. It is one of the most important issues the public health system is currently facing, and it comes at a significant cost to society both economically and socially. The burden of an undesired pregnancy still weighs heavily on Ethiopia. The purpose of this study was to assess the effectiveness of machine learning algorithms in predicting unintended pregnancy in Ethiopia and to identify the key predictors. Method Machine learning techniques were used in the study to analyze secondary data from the 2016 Ethiopian Demographic and Health Survey. To predict and identify significant determinants of unintended pregnancy using Python software, six machine-learning algorithms were applied to a total sample of 7193 women. The top unplanned pregnancy predictors were chosen using the feature importance technique. The effectiveness of such models was evaluated using sensitivity, specificity, accuracy, and area under the curve. Result The ExtraTrees classifier was chosen as the top machine learning model after various performance evaluations. The region, the ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total births, age at first birth, and mother’s educational status are identified as contributing factors in that predict unintended pregnancy. Conclusion The ExtraTrees machine learning model has a better predictive performance for identifying predictors of unintended pregnancies among the chosen algorithms and could improve with better policy decision-making in this area. Using these important features to help direct appropriate policy can significantly increase the chances of mother survival.


Background
An unintended pregnancy is a pregnancy that is either unwanted or mistimed, such as when it occurs earlier than desired.It is one of the most important issues the public health system is currently facing, and it comes at a significant cost to society both economically and socially.It results in decreased workforce productivity and quality of life [1,2].Between 2015 and 2019, there were 121 million unintended pregnancies worldwide.Every year, 61% of pregnancies result in abortions.Although unintended births have decreased globally, there has been an uneven distribution between high-income and low-income nations.The prevalence rate in high-income countries is 66 per 1000 pregnancies however; it was 93 per 100 women in middle and low-income countries.The burden of an undesired pregnancy still weighs heavily on Ethiopia despite the availability of broad family planning services.According to a systematic review conducted in Ethiopia, the overall prevalence of unintended pregnancy was 28%.Also, results from the 2016 Ethiopian Demographic and Health Survey (EDHS) support this finding, which indicates that 25% of all births in the previous five years and all ongoing pregnancies were unintended [3][4][5].
Globally, unintended pregnancies have a variety of detrimental effects on mothers and fetuses.One of the most common negative consequences of unintended pregnancy is induced abortion with its complications.Six out of 10 of all unintended pregnancies end in induced abortion.People with unintended pregnancies frequently turn to unsafe abortion when they encounter obstacles to obtaining a safe, quick, inexpensive, geographically accessible, respectful, and non-discriminatory abortion [6,7].The child of an unintended pregnancy is more likely to be maltreated, to be born weighing less than 2,500 g, to die within the first year of life, and to lack the resources necessary for healthy growth.It also affects mothers by making the relationship with their spouse more likely to end in divorce, and she may be more likely to experience physical violence herself.The mother and father can experience financial difficulty and fall short of their aspirations for their careers and education [8].
According to different studies, the prevalence of unintended pregnancies was highest among women who were between the ages of 18 and 24 years, had never used family planning methods, had low income (less than 100% of the federal poverty level), had not completed high school, had a birth interval of fewer than two years, was living in rural areas, was pregnant only by their husband's decision, had gravidity greater than or equal to five, was non-Hispanic black or African American, and was cohabiting but had never married [6][7][8][9][10][11][12].
The following recommendations are made to help reduce unintended pregnancies: increasing access to contraception; raising awareness of the importance of feelings, attitudes, and motivation in using contraception and preventing unintended pregnancies; developing and meticulously evaluating a variety of local programs, and encouraging research to create new contraceptives.The new guideline of abortion care also recommends straightforward primary care interventions that involve assuring access to medical abortion pills, ensuring that correct care information is available to all those who need it, improving the quality of abortion care delivered to women and girls, and task-sharing by a wider range of health providers [8,13].
The potential causes of unintended pregnancy have been the subject of numerous research investigations using traditional statistical analysis methods [6,[14][15][16][17].Nevertheless, no prior studies have attempted to use machine learning to predict unintended pregnancy and identify predictive factors.As a result, when the number of input variables and potential correlations rises, previously employed statistical procedures become less accurate, producing incorrect conclusions [18].Machine learning was used more effectively [19] and machine learning methods are a good solution to these issues because they can capture complicated and nonlinear correlations in the data, improving prediction accuracy above traditional regression models.So, the purpose of this work was to use the most advanced machine learning models to predict unintended pregnancy and identify its predictors.

Data source and population
This study relied on the 2016 Ethiopian Demographic and Health Survey (EDHS), a nationally representative survey that was conducted from January 18 to June 27, 2016.The survey's sample was divided into two groups and then selected in two stages.A total of 645 EAs were chosen in the first stage, with the chance of selection inversely correlated with the size of the EA (202 in urban regions and 443 in rural areas).28 households per cluster were chosen in the second stage by a methodical process with an equal probability.A comprehensive amount of data was gathered from 16,650 households, 15,683 female respondents, and 12,688 male respondents on topics such as adult and childhood morbidity and mortality, awareness and attitudes toward HIV/AIDS, and other significant public health issues.These topics included fertility and fertility preference, marriage, awareness and use of family planning methods, as well as issues related to reproductive health [4].A total weighted sample of 7590 women (15-49 years old) of reproductive age who had birth within the five years before the survey was used.

Study features
The dependent variable was unintended pregnancy, which encompasses unintended or later-wanted pregnancies.The independent features for this study were the maternal age, maternal occupation, marital status, religion, region parity, household size, wealth index, Husband occupational status, Husband education, residence past miscarriages, knowledge of the ovulation cycle, and distance from the health facility, Ideal number of children, age at first sex, refusal sex, total birth, and age at first birth.To make important independent variables appropriate for analysis, they were recorded or categorized.

Data processing and analysis
A high-quality dataset is required for machine learning to make predictions.As a result, managing the missing data during the dataset's pre-processing is an essential step.Encoding data is a fundamental and necessary procedure that is included in data pre-processing.Categorical variables were encoded using one-hot and label encoding.Values that fall into two or more categories and are discrete rather than continuous are said to be categorical.One hot encoding and label encoding technique were used in this work to encode categorical variables [20].

Data analysis
In this study, descriptive statistics were used to describe the socio-demographic characteristics using frequency and percentage.Data analysis stages included preprocessing the data, feature selection, data splitting, addressing imbalanced data, model building, and model performance testing.Python version 3 was the tool used in this study.

Feature selection method
The goal of feature selection is to rank and prioritize the most important predictors in the dataset.This is determined by computing the information gain values for each of the selected variables.To find the major factors that significantly result in unintended pregnancy, we used a decision tree classifier, extra trees classifier, XGBoost classifier, gradient boosting classifier, and a random forest model in this work.The higher information gain values indicate significant variables and their class have strong associations.The top ten information values were chosen at random.It is a relatively effective method for reducing model complexity and accelerating the processing of machine learning algorithms [21].

Data split
For machine learning approaches, the dataset is randomly divided into two parts: one is a training dataset that trains the model, and the second is a test dataset that predicts the response variable and sees if the predicted outcome is similar to the actual outcomes.The validation dataset is also taken into consideration for the parameter estimates to be incorporated into the training models [22].However, The complete dataset for this study was divided into ten folds using the stratified tenfold crossvalidation approach.

Imbalance data handling
The effectiveness of machine learning algorithms is frequently assessed using predictive accuracy, however, due to the imbalance in the data, it is challenging to identify the root cause of unintended pregnancy.To balance the majority and minority classes in this study, the Synthetic Minority Oversampling Technique (SMOTE) [23] was employed.SMOTE is a pre-processing method for learning algorithms that effectively handles class imbalance by oversampling imbalanced datasets.By linearly overlaying at random between a few samples and their neighbors, it generates a new sample [24].

Method of building a predictive model
The most effective models were picked to do the training after the data was arranged and split into training and testing samples.To produce a prediction, it was necessary to select the appropriate classifiers for the result variable's categorical nature, which made the challenge a classification task.In this work, six supervised classification methods were employed.The ExtraTrees classifier, Random Forest, Decision Tree, Logistic Regression, Gradient Boosting, and XGBoost were used for this study.The algorithms were chosen for their accuracy, training time, ability to handle missing data, and ease of understanding and learning.

Performance evaluation for predictive model
Following model training, each model's performances are assessed and contrasted with one another.Based on the confusion matrix, the prediction models' performance was assessed.Precision, sensitivity, specificity, F1-score, and area under the receiver-operating characteristic (AUC-ROC) were utilized in this study to evaluate the model's performance.
The confusion matrix is a common performance measuring tool used in machine learning classification tasks and is used to describe a model's output as a binary class [25].The performance of ML models was also visualized using the ROC curve (or receiver operating characteristic curve) (Table 1).
According to the confusion matrix above, the following lists recall (sensitivity), (specificity), precision, and accuracy were derived In summary, Fig. 1 shows the machine learning process used in this study.

Sociodemographic characteristics of participants
From the total number of reproductive-age women who had unintended, this study includes 7589 women who were reproductive age.69.36% (5264) were between the ages of 20 and 34.92.58% (7026) women were married and 79% (6042) lived in a rural area and 41.24% (3129) lived in Oromia.More than half of the women had not educated, and 33.10% (2512) of all women were orthodox (Table 2).

Imbalance data handling
Unbalanced data handling was a key strategy for this study to handle the problem of unbalanced data and boost the performance of the machine learning algorithms.An imbalanced dataset was balanced using the SMOTE sampling method, and the accuracy and AUC based on the chosen machine learning algorithms were compared for the balanced and unbalanced datasets.
When compared to another classifier in the unbalanced dataset, gradient boosting performed better with an AUC of 0.682, while logistic regression had a higher AUC of 0.668.The Extra tree classifier in the SMOTE has a higher accuracy of 84.93% and an AUC of 0.926.Moreover, the Random forest also outperformed next to the Extratrees classifier on a balanced dataset, with the test accuracy and AUC values of 84.40 and 0.924, respectively (Table 3).Machine learning is difficult with unbalanced data because values from the minority class or rarely occurring classes are wrongly categorized as instances of the majority class, which lowers the performance of the classifying algorithm.After all, the classifier is overwhelmed by the dominant class and ignores the unintended class, which is the minority class.After SMOTE was applied to the unbalanced dataset, the overall number of records Fig. 1 Workflow of machine learning for unintended pregnancy prediction rose.(Fig. 2).We mainly used AUC to compare the classifier and balanced sampling method.

Implementation of unintended pregnancy prediction models
In this study, the data were split into training and test sets, which together made up 90% and 10% of the total data.The model performance, including prediction evaluation metrics, can be evaluated in comparison to various machine learning classifiers.To avoid overfitting, the popular 10-fold cross-validation method was applied to this study.The experiments were mainly divided into two sections: the first section trained the different classification algorithms using 32 features from an imbalanced dataset, and the second section employed a balanced sampling strategy to determine which model with 32 features was the best.High accuracy, precision, sensitivity, specificity, f1-score, and AUC were obtained by applying various machine learning classification algorithms like Logistic regression, decision tree, random forest, gradient boosting, XGBoost, and ExtraTrees) to the balanced data using SMOTE.In comparison to other algorithms, ExtraTrees produces better accuracy and results in performance metrics.The ExtraTrees classifier (AUC = 0.928) outperforms all other classifiers in terms of performance metrics and is the best in foretelling unintended pregnancies, as shown by the ROC curve in Fig. 3. Alongside the ExtraTrees classifier, the performance of random forest (AUC = 0.924), XGBoost (0.898), logistic regression (AUC = 0.775), XGBoost (0.898), gradient boosting classifier (AUC = 0.824), and decision tree classifier (AUC = 0.76) was also impressive (Fig. 3).

Extratrees classifier performance
From the balanced dataset, the ExtraTrees classifier's performance was quite strong compared to other selected   classifiers.The hyper-parameter tuning and feature selection were carried out after the best model had been chosen.The important predictor of unintended pregnancy was established to compare the model's performance.

Tuning an ExtraTrees classifier with grid serach CV
After selecting the best model, this study applied the hyperparameter tuning to compare it with the default hyperparameter tuning.Figure 4 shows that default hyperparameter tuning was higher performed than hyperparameter tuning using the best classifier of the ExtraTrees Model.According to the results, the Extra-Trees Model classifier with tuned hyperparameters was less performed than the ExtraTrees classifier with the default hyperparameter.Therefore, this study used the ExtraTrees classifier with a default hyperparameter with the tuning of sensitivity, specificity, precision, and f1-score of 83.79%, 83.94%, 83.94%, and 84.04%, respectively.The ExtraTrees Model classifier with default hyperparameter tuning had the highest AUC value, which means that the classifier properly identified unintended or unplanned.Then this study applied the default hyperparameter tuning (Fig. 4).

Top features from the chosen classifier
This experiment was performed to examine the classifier's ability to predict unintended pregnancy and the impact of feature selection.Based on all chosen classifiers, this study identified the features that predict unintended pregnancy to determine which features were the best predictors.The cumulative result of the classifier feature importance was chosen as the suitable way to identify the features that most reliably predict unintended pregnancy for this study using these findings as a guide.The region, the Ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and Mother's Educational Status were the factors that had the greatest impact on unintended pregnancy out of all the predictors.Table 4 shows that from the chosen classifier, the top ten features were selected using the median results (Table 4).

ExtraTrees classifier features importance
Relevant features selected by an ExtraTrees classifier show that at the bottom were identified as the top predictors of unintended pregnancy.Of all features, region, ideal number of children, religion wealth index, age at first sex, husband education, household size, refusal sex, total birth, and decision on marriage were top predictors (Fig. 5).

Discussion
According to earlier research on this topic, Ethiopia has one of the highest rates of unintended pregnancies worldwide [26][27][28][29].Findings also revealed that while the prevalence of unwanted pregnancies has occasionally declined Fig. 4 Comparison of tuned and default hyperparameter using ExtraTrees classifier in the nation, more work is still needed to support this pattern and manage the phenomenon's undesirable repercussions.Machine learning models are regarded as state-of-the-art approaches and techniques for quick and accurate problem-solving.This study has aimed to predict and identify the predictors of unintended pregnancy and build the best performance of a machine learning classifier.Six machine learning algorithms such as Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and ExtraTrees, were applied to predict unintended pregnancy in Ethiopia using EDHS 2016 data.The above models were chosen to build and evaluate the best predictive model using the key predictors, which will increase model prediction accuracy and generalizability.stratified 10-fold cross-validation has been used to train the classifiers on a set of training data.To determine the optimal accuracy, several tests were conducted applying both balanced and unbalanced datasets.The outcome demonstrated that imbalanced data produced low-performance metrics.To balance the unbalanced data, this  study used the SMOTE balancing sampling approach.AUC, recall, precision, and accuracy performance evaluations showed that the ExtraTrees classifier performed better than all other selected classifiers (84.75%, 84.66%, 84.81%, and (0.925)., respectively).As a result, this classifier was selected in our study for the prediction of unintended pregnancy.
Using the relevance values of independent features for the ExtraTrees classifier, this study identified the key influencing factors for unintended pregnancy.The most significant variables that contribute to higher performance in unintended pregnancy prediction were found using the average results of all classifiers used in the feature selection process.The important predictors of unintended pregnancy among all independent characteristics included region, the ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and mother's educational status.
The region was a very important predictor of unintended pregnancy.This research supports a systematic review and meta-analysis of an Ethiopian observational study [6], as well as a study using EDHS 2016 data that found that different regions of Ethiopia [14] had higher rates of unintended pregnancies.The sociodemographic variations between the individuals in each region could be one of the causes.
The machine learning classifier identified that the wealth index was a highly important feature for predicting unintended pregnancies.This finding is supported by research conducted in India [30], Bangladesh [31], Iran [32], Nepal [33], Nigeria [34], Kenya [35], and across different parts of Ethiopia [36][37][38], revealed that women who have high wealth status are more empowered to take charge of their sexual and reproductive health matters than women who have poorer wealth status.The relationship between income status, occupation, and unintended pregnancies may be explained by the connection between formal employment and social networks, and earning potential [39].
Religion was an important feature in predicting unintended pregnancy.previous studies conducted in Bangladesh [31], Nepal [40], Addis Zemen, and Ethiopia [41], and a study using EDHS 2016 data in Ethiopia [14] revealed that Women who had a religion tend to be highly associated with unintended pregnancy.The possible explanations for the association include the fact that women believe that every child is a gift from God and that their religion discourages the use of contraception.Thus, mothers who follow a particular faith do not think that unintended pregnancy occurs.
Based on the finding of this study, the husband's education and the mother's educational status of the respondent were found other relevant features for predicting unintended pregnancy.This finding is supported by research conducted in Russia [42], Bangladesh [43], Uganda [44], Malawi [45], and Southern Ethiopia [46], which reported that husbands and mothers who had educational status were more likely associated with unintended pregnancy.The possible explanation might be due.
Other relevant features of unintended pregnancy were the Ideal number of children, age at first sex, refusal of sex, total birth, and age at first birth.
In findings, our study shows that machine learning techniques can be used to identify predictive characteristics related to unwanted pregnancy.Machine learning methods appear to be useful for determining which indicators are most important for predicting an unplanned pregnancy.Our study model might help with the crucial public health problem of identifying and managing unintended pregnancies.
For predicting unintended or unplanned pregnancies, the suggested method has the best ROC, accuracy, precision, recall, and specificity.This prediction is in support of providing women with comprehensive services and extended working hours.Effective predictive modeling may raise medical care standards and increase maternal survival.Therefore, the prediction models of unintended pregnancy developed in our work can significantly contribute by detecting women with undesired or unintended pregnancies and adopting the most effective supportive measures, such as offering training or any other form of information transmission.This might reduce misunderstanding by providing quantitative, unbiased, and research-based models for risk classification, prediction, and ultimately care planning.This work cannot be considered complete without its limitations.In contrast to the statistical model, the machine learning model's result lacks a coefficient and odds ratio, making it challenging to determine how much and in which direction various factors impact the final result.In addition, Machine algorithms are primarily less interpretable because they lack parameters and typically identify or anticipate particular variables according to how significant a part they played in the current study's determination of the unwanted pregnancy.

Conclusion
In predicting unintended pregnancy factors in Ethiopia, the ExtraTrees classifier has a somewhat higher predictive ability than other selected machine learning classifiers.By using the ExtraTrees classifier to choose the desired features related to unintended pregnancy, we found that region, the ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and mother educational status were the significant predictors of unintended pregnancy.This work emphasizes the use of machine learning algorithms to predict and better comprehend top significant unintended pregnancy predictor variables to improve essential policy directions.

Fig. 5
Fig. 5 Relevant features selected by an ExtraTrees feature importance

Table 3
Compares imbalanced data handling techniques using accuracy and Area under the curve (AUC) SMOTE: Synthetic Minority Over-sampling Technique, AUC: Area Under Curve, Underline and bold numbers were the highest score of the classifier

Table 4
Compares selected machine learning models in choosing the top features RF: Random Forest; ExtraTrees; DT: Decision Tree; LR: Logistic Regression; GB: Gradient Boost; XGBoost: Extreme Gradient Boosting